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Abstract 

Statistical models for genetic linkage analysis of k locus diseases are /c-dimensional 
subvarieties of a (3 k — l)-dimensional probability simplex. We determine the alge- 
braic invariants of these models with general characteristics for k = 1, in particular 
we recover, and generalize, the Hardy- Weinberg curve. For k = 2, the algebraic 
invariants are presented as determinants of 32 x 32-matrices of linear forms in 9 
unknowns, a suitable format for computations with numerical data. 



1 Introduction 

Most common diseases have a genetic component. The first step towards un- 
derstanding a genetic disease is to identify the genes that play a role in the 
disease etiology. Genes are identified by their location within the genome. Ge- 
netic linkage analysis, or gene mapping [5,8,9,10], is concerned with this prob- 
lem of finding the chromosomal location of disease genes. Over 1,200 disease 
genes for have been successfully mapped [2] , and this has led to a much better 
understanding of Mendelian (one gene) disorders. Most common diseases are, 
however, not caused by one gene but by k > 2 genes. The challenge today is 
to understand complex diseases (such as cancer, heart disease and diabetes) 
which are caused by many interacting genes and environmental factors. 

The human genome has approximately 25,000 genes. Genes encode for pro- 
teins, and proteins perform all the cellular functions vital to life. We all have 
the same set of genes, but there are many variants of each gene, called alleles. 
Usually these variants all produce a functional protein, but a mutation in a 
gene can change the protein product of the gene, and this may result in dis- 
ease. Since mutations are rare, two affected siblings who have the same genetic 
disease probably inherited the same mutation from a parent. Genetic linkage 
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analysis makes use of this fact: one tries to locate disease genes by identifying 
regions in the genome that display statistically significant increased sharing 
across a sample of affected relatives, such as sibling pairs [6]. 

The statistical models used in genetic linkage analysis are algebraic varieties. 
The given data are /c-dimensional tables of format 3 x 3 x • • • x 3. As usual 
in algebraic statistics ([7], [11], [13, §7]), there is one model coordinate z^...^ 
for each cell entry, where zi , Z2, • • • ,ik £ {0, 1,2}. This coordinate represents 
the probability that for an affected sibling pair the IBD sharing (see section 
2) at the first locus is i 1: the IBD sharing at the second locus is i 2 , etc. 
The model is a subvariety of the probability simplex with these coordinates. 
It is /c-dimensional, because the Zi 1 i 2 ...i k are given as polynomials in k model 
parameters pi,p 2 , . . . ,pk. Here pj represents the frequency of the disease allele 
at the j-ih locus. We consider an infinite family of models which depends 
polynomially on 3 fc model characteristics fi x i 2 ...% k - The characteristic U\%i-%h 
represents the probability that an individual who has ij copies of the disease 
gene at the j-th locus will get affected. Note that the parameters pi and the 
characteristics fi x i 2 -i k are unknown, but we might be interested in estimating 
them from the given data z. 

This paper is organized as follows. Section 2 contains a self-contained deriva- 
tion of the models in the one-locus case {k — 1). Here the models are curves in 
a triangle with coordinates (z , zi, z 2 )- For general characteristics, (/ , /i, f 2 ), 
the curve has degree four. In Section 3 we compute its defining polynomial, 
a big expression in zq, z±, Z2, fo, fi, f2- This is done by elimination using the 
univariate Bezout resultant. We discuss what happens for special choices of 
characteristics which have been studied in the genetics literature. 

In Section 4 we derive the parametrization of the linkage models for k > 2. In 
the two- locus case (k — 2), the models are surfaces in the space of nonnegative 
3 x 3-tables (2^) whose entries sum to one. For general characteristics (fy), 
the surface has degree 32. In Section 5 we apply Chow forms to derive a 
system of algebraic invariants. These are the polynomials which cut out the 
surface. Each invariant is presented as the determinant of a 32 x 32-matrix 
whose entries are linear forms in the Zij whose coefficients depend on the fy. 
We argue that this format is suitable for statistical analysis with numerical 
data. Computational issues and further directions are discussed in Section 6. 



2 Derivation of the One-Locus Model 

The genetic code, the blueprint of life, is stored in our genome. The genome 
is arranged into chromosomes which can be thought of as linear arrays of 
genes. The human genome has two copies of each chromosome, with 23 pairs 
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of chromosomes, 22 autosomes and the sex chromosomes X and Y (women 
have XX and men XY) . Each parent passes one copy of each chromosome to a 
child. A chromosome passed from parent to child is a mosaic of the two copies 
of the parent, and a point at which the origin of a chromosome changes is 
called a recombination. This is illustrated in Figure 1. 

Between any two recombination sites, the inheritance pattern of the two sib- 
lings is constant and is encoded by the inheritance vector x = (xxi,Xi2, x 2 i, £22 )• 
The entry Xkj is the label of the chromosome segment that sibling k got from 
parent j. If we label the paternal chromosomes with 1 and 2 and the maternal 
chromosomes with 3 and 4, then Xu,x 2 i G {1,2} and £12,^22 G {3,4}, so 
there are 16 possible inheritance vectors x. They come in three classes: 



C = { (1, 3, 2, 4), (1, 4, 2, 3), (2, 3, 1, 4), (2, 4, 1, 3) }, 

C x = { (1, 3, 1, 4), (1, 4, 1, 3), (2, 3, 2, 4), (2, 4, 2, 3), 

(1,3,2,3), (2,3,1,3), (1,4,2,4), (2,4,1,4)}, 

C 2 = {(1,3,1,3), (1,4,1,4), (2,3,2,3), (2,4,2,4)}. 



We say that two siblings share genetic material, at a locus, identical by descent 
(IBD) if it originated from the same parent. The IBD sharing at a locus can 
be 0, 1 or 2, where the inheritance vectors in Ci correspond to IBD sharing 
of i. Since at a random locus in the genome each inheritance vector is equally 
likely the IBD sharing is 0, 1 or 2 with probabilities 1/4, 1/2 and 1/4. 



1 2 



3 4 






Inh. vector Sharing 



(1,4,2,3) 
(2,4,2,3) 
(2,4,2,4) 
(2,4,1,4) 
(1,4,1,4) 
(1,3,1,4) 
(1,3,1,4) 



Fig. 1. An example of the inheritance of one chromosome pair in parents and a 
sibling pair. Squares represent males and circles females. 



Each individual has two alleles, i.e. two copies of every gene, one on each 
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chromosome. A genotype at a locus is the unordered pair of alleles. We are 
only concerned with whether one carries an allele that predisposes to disease, 
which we call d, or a normal allele, called n. The set of possible genotypes at 
a disease locus is G = {nn, nd, dn, dd}. 

Let p denote the frequency of the disease allele d in the population. This 
quantity is our model parameter. We assume Hardy- Weinberg equilibrium: 

Pr(nn) = (1 — p) 2 , Pr(nd) = p(l — p), Pr(dn) = p(l — p) and Pr(dd) = p 2 . 

A disease model is specified by / = (/o, /i, f 2 ), where fi is the probability that 
an individual is affected with the disease, given i copies of the disease allele, 

/o = Pr (affected | nn), f 2 = Pr (affected | dd) , 
fi = Pr (affected | nd) = Pr ( affected | dn). 

The quantities fi are known as penetrances in the genetics literature. In this 
paper, we call them model characteristics to emphasize their algebraic role. 

The coordinates of a disease model are z = (z , z\, z 2 ), where Zi is the proba- 
bility that the IBD sharing for an affected sibling pair is i at a given locus, 

Zi = Pr (IBD sharing = i | both sibs affected), i = 0,1,2. 

Then, as was stated above, at a random locus not linked to the disease gene the 
distribution is z nu u = (1/4,1/2,1/4). Data for linkage analysis are collected 
from a sample of n siblings (and parents) as follows. The marker information 
is used to infer the IBD sharing at each marker locus for each sibling pair and 
at any particular locus, one uses the vector (n , n\,n 2 ), where is the number 
of sibling pairs whose inferred IBD sharing is i at the locus. Each such data 
point determines an empirical distribution 

z = (z , z x ,z 2 ) = (n-o/n, n x jn, n 2 /n) , where n + n x + n 2 = n. 

The objective is to look for regions in the genome where z deviates signifi- 
cantly from z nu u = (1/4, 1/2, 1/4). Such regions may be linked to the disease. 

The one-locus model is given by expressing the coordinates (z , Z\,z 2 ) as poly- 
nomial functions of the parameter p and the characteristics fo, f\, f 2 . These 
polynomials are derived as follows. Consider the set of events £j = CixGxG 
for i — 0,1, 2. Each event in Si consists of an inheritance vector, a genotype 
for the mother and a genotype for the father. This triple determines the total 
number m of disease alleles carried by the parents and the numbers k± and k 2 
of disease alleles carried by the two siblings. The probability of the event is 

fkjk 2 p m q 4 ~ m , where q = 1 - p. 
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Then, up to a global normalizing constant, the IBD sharing probability z\ is 
the sum over all events in Si of the monomials fk 1 fk 2 P m( l 4l ~ m - Hence Zq is a 
sum of |£ | — 64 monomials, z\ is a sum of 128 monomials, and z 2 is a sum of 
64 monomials. But these monomials are not all distinct. For instance, all four 
elements of Co x {nn} x {nn} C So contribute the same monomial f^q 4 to 
zq. By explicitly listing all events in So, Si and S 2 , we get the following result. 

Proposition 1 The coordinates z\ of the one-locus model are homogeneous 
polynomials of bidegree (2,4) in the characteristics (fo, fi, / 2 ) and the param- 
eters (p,q). The column vector (z ,Zi,z 2 ) T equals the matrix-vector product 
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Proposition 1 says that the one-locus model has the form 

(z ,z 1 ,z 2 ) T = F ■ (q 4 ,pq 3 ,p 2 q 2 ,p 3 q,p 4 ) T , 



(1) 



where F is a 3 x 5-matrix whose entries are quadratic polynomials in the 
penetrances fi. The resultant computation to be described in the next section 
works for any model of this form, even if the matrix F were more complicated. 



3 Curves in a Triangle 



Suppose that we fix the model characteristics fo, fi, fi and hence the matrix F. 
Then (1) defines a curve in the projective plane with coordinates (zo : z\ : z 2 ). 
The positive part of the projective plane is identified with the triangle 



{ (20, zi, z 2 ) : zq, Zi,z 2 > and z + z x + z 2 = 1 }. 



(2) 



The one-locus model with characteristics /o,/i,/2 is the intersection of the 
curve with the triangle. We are interested in its defining polynomial. 

Proposition 2 For general characteristics fo, fi, fi, the one-locus model is a 
plane curve of degree four. The defining polynomial of this curve equals 
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I{z Q ,z u z 2 ) = a 1 zlz 2 + a 2 zlz 2 1 +a,zlz 1 z 2 + a i zlzl + a b z Q zl 

2 2 3 4 

+ <1qZqZ^Z 2 + Oj-jZqZ\Z2 + (28^0^2 a 9 Z l 

+ a\$z\z 2 + d\\z\z\ + d\ 2 z\z\ + a±^z 2: 

where each ai is a polynomial homogeneous of degree eight in (/ , fi, $2)- 

This proposition is proved by an explicit calculation. Namely, the invariant 
I(zq, zi, z 2 ) is gotten by eliminating p and q from the three equations in (1). 
This is done using the Bezout resultant ([12, Theorem 2.2], [13, Theorem 4.3]). 
Specifically, we are using the following 4 x 4-matrix from [12, Equation (1.5)]: 

' [12] [13] [14] [15] N 

[13] [14] + [23] [15] + [24] [25] 
B = (6) 

[14] [15] + [24] [25] + [34] [35] 
V [15] [25] [35] [45] ) 



The determinant of this matrix is the Chow form [3] of the curve in projective 
4-space P 4 which is parameterized by the vector of monomials (q 4 , pq 3 , p 2 q 2 , p 3 q, p A ) . 
We are interested in the curve in the projective plane P 2 which is the image 
of that monomial curve under the linear map from P 4 to P 2 given by the 
matrix P. Section 2.2 in [3] explains how to compute the image under a linear 
map of a variety that is presented by its Chow form. Applying the method 
described there means replacing the bracket [ij] by the 3 x 3-sub determinant 
with column indices i, j and 6 in the matrix from Proposition 1 augmented 
by z: 



(F,z) 



4/o 2 



16/c/i 



8/0/2 + 16/ 2 



16/1/2 



4/1 



\ 



8/ 2 8(/ 2 + 2/ / 1 + / 1 2 ) 16(/ /i+/?+/i/ 2 ) 8(/ 1 2 + 2/ 1 / 2 + /|) 8/ 2 z x 

4/| z 2 



Af 2 8/ 2 + 8/ 2 4/ 2 + 16/ 2 + 4/ 2 8/ 2 + 8/ 2 
The desired algebraic invariant equals (up to a factor) the determinant of B: 
I(z , Zl ,z 2 ) = 2- 16 /o- 2 / 2 ~ 2 (/o-2/i + /2)- 4 -det(P). (4) 



If the characteristics /o,/i,/2 are arbitrary real numbers between and 1 
then the polynomial I(z , zi, z 2 ) is irreducible of degree four and its zero set 
is precisely the model. For some special choices of characteristics /j, however, 
the polynomial I(zq, zi, z 2 ) may become reducible or it may vanish identically. 
In the reducible case, the defining polynomial is one of the factors. Consider 
the following special models which are commonly used in genetics: 
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fo fi fi 

dominant 'Off 
additive : f/2 f 
recessive : / 

Here < / < 1. For the dominant model our invariant specializes to 
I(z ,z 1 ,z 2 ) = Af(z 1 -ZQ-z 2 ){ zlz () - 8ziz z 2 + Azizj + ^ z l z 2 + - 4zf ), 
and the defining polynomial of the model is the underlined cubic factor. 
For the additive model our invariant specializes to 

I(z ,z 1 ,z 2 ) = — + - 82^2 + zj)( zi - z - z 2 ) 2 , 

and the defining polynomial of the model is the underlined linear factor. 
It can be shown that I(zq, zi, z 2 ) vanishes identically if and only if 

fo = fi = or f\ = f 2 = or f = f 1 = f 2 . 
This includes the recessive model, which is the familiar Hardy- Weinberg curve: 

z\ - Az z 2 = 0. 



(0,0,1) 




Fig. 2. Holmans' triangle. The larger triangle is the probability simplex, 
zo + z\ + z 2 = 1 and the smaller triangle is the possible triangle for sibling pair IBD 
sharing probabilities. The curve from (1/4,1/2,1/4) to (0,0,1) is the Hardy- Weinberg 
(recessive) curve. The curve from (1/4,1/2,1/4) to (0,1/2,1/2) is the dominant 
curve and the line between the same points is the additive curve. 
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Holmans [8] showed that the IBD sharing probabilities for affected sibling pairs 
must satisfy 2z < Z\ < z + z 2 . This means we can restrict our attention to 
the smaller triangle (Holmans' triangle) in Figure 2. We can graph the curve 
in the triangle for any choice of model characteristics. The part of the curve 
corresponding to values of p G [0, 1] is within the smaller triangle. 

It is worth noting that not all points (zq,z 1 ,z 2 ) in Holmans' triangle which 
satisfy the algebraic invariant are in the image of a point (p, q) with real 
coordinates. Consider e.g. the model with characteristics / = l,/i = and 
f 2 = 1 and complex parameters (p, q). The real part of the curve corresponding 
to this model is shown in Figure 3. Two segments of the curve are within 
Holmans' triangle, one of which (dotted) corresponds to values p G [0, 1]. The 
other segment has a complex pre-image. 




Fig. 3. Holmans' triangle. The larger triangle is the probability simplex, 
zq + z\ + z 2 = 1 and the smaller triangle is the possible triangle for sibling pair 
IBD sharing probabilities. The curve corresponds to a model with characteristics 
/o = 1, /i = and f 2 = \. The dotted part of the curve is the image of real valued 
p, and the solid part is the image of p = 1/2 + y\J — 1, for a real number y. 

We expressed the IBD sharing of the sibling pair at a gene locus (the model 
coordinate z) as a function of f , f±, f 2 and p. In practice, however, we get 
data at marker loci, regularly spaced across the chromosomes, not at the gene 
locus. If there has been no recombination between the gene locus and a marker 
locus then the IBD sharing at the two loci is the same, but different if there 
has been a recombination in either sibling. Let 9 be the recombination fraction 
between the gene locus and the marker locus. The new parameter 9 depends 
on the distance between the two loci. Following [5], we can express the IBD 
sharing probabilities at a marker locus distance 9 away from the gene by the 
formula 

(z ,z u z 2 ) T = F e -(q\pq 3 ,p 2 q 2 ,p 3 q,p 4 ) T , (5) 
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where F e = and 



ip 2 ipip ip 2 
2^ ip 2 + ip 2 2ifjip 
tp 2 iptp ip 2 



, with if> = 6 2 + (1 - O f and ip = 1 - tp. 



One can easily repeat the resultant calculation in Proposition 2 to obtain 
the equation of the larger family of curves defined by (5). Note that 9 = 
corresponds to the earlier case, and increasing 9 shifts the curve towards z nu u. 



We close this section with a statistical discussion. We wish to find the gene 
locus using the inferred IBD sharing at the marker loci. Since 9 can be thought 
of as a measure of the distance between the marker locus and the gene locus 
we wish to estimate 9 at each marker locus. The inferred IBD sharing can be 
used to obtain an estimate of the model coordinates z. If p, f , fi and f 2 are 
known it is then easy to estimate 9. However that is rarely the case, and it is 
impossible to identify all of the unknown quantities p, fo, fi, J2 and 9 from the 
coordinates z. Instead the model (1) is applied to biological data as follows. 
The IBD sharing at the gene locus (and at nearby marker loci) is largest when 
the disease allele has a strong effect and/or the disease allele is rare, i.e. when 
fo < /1 < f'2 (and preferably fo <C f'2), and p is small. In these, biologically 
interesting, situations the data point z is clearly different from z nu u. So in 
practice a test for genetic linkage tests whether z is significantly different 
from z nu \\. A widely used test statistic for linkage is S pairs = z 2 + zi/2 which 
measures deviations from z nu u along the line corresponding to the additive 
model. 



4 Derivation of the Two-Locus Model 



Many common genetic disorders are caused by not one but many interacting 
genes. We now consider the two-locus model, k = 2, where we assume that 
two genes cause the disease, independently or together. We shall assume that 
the genes are unlinked, i.e., they are either on different chromosomes or far 
apart on the same chromosome. The derivation is much like in Section 2. 

The model parameters are p\ and p 2 , where Pi is the frequency of the dis- 
ease allele at the ith locus. A two-locus genotype is an element in G x G = 
{nn, nd, dn, dd} 2 . The model characteristics are / = (foo, foi, ■ ■ ■ , f'22) where 
fij, is the probability that an individual is affected with the disease, given i 
copies of the first disease allele and j copies of the second disease allele: 
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The model coordinates are z = (zoo, zoi, Z02, zio, zu, zu, Z20, £21, -222), where Zij 
represents the probability for an affected sibling pair that the IBD sharing at 
the first gene locus is i, and j at the second gene locus: 

z^ = Pr( IBD sharing = (i,j) | both sibs affected ), — 0,1,2. 

The IBD sharing at two random loci, neither of which linked to the disease 
genes, is the null hypothesis z nuU = (1/16,1/8,1/16,1/8,1/4,1/8,1/16,1/8,1/16). 

The polynomial functions which express the coordinates Zij in terms of pi,p2 
and the are derived as follows. We consider the set of events 

Si x Sj = d x G x G x Cj x G x G for i, j = 0, 1, 2. 

Each event in E% x consists of an inheritance vector, the genotype of the 
father and the genotype of the mother, at each locus. For a given event we 
know the total number m\ and ni2 of disease alleles carried by the parents 
at the first and second locus and kn, ku, &2i, ^22, where kij is the number of 
disease alleles carried by sibling % at locus j. The probability of the event is 

/fe 11 fe 12 /fe 21 fe 22 Pr?i : " mi Pr?2 _m2 > where Qi = 1 -Pi and 92 = 1 -P2- 

Up to a normalizing constant, each IBD sharing probability z^ is the sum of 
the monomials fk 11 k 12 fk 21 k22PT 1( li~ mi P2 n2( l2~ m2 over an events in E% x £j. 

Proposition 3 The coordinates of the two-locus model are homogeneous 
polynomials of tridegree (2,4,4) in the characteristics (/c/1,/2); the parame- 
ters (pi, qi) at the first locus, and the parameters (p2, 92) oi the second locus. 

The matrix form of the one-locus model given in Proposition 1 immediately 
generalizes to the two-locus model. Let n denote the column vector whose 
entries are the 25 monomials of bidegree (4, 4) listed in lexicographic order: 

:= ( q\q\, q\p2^ Q1P2Q2, ■■■ , P1Q1Q2, PiQiP2ql, • • • , p\p\ )■ 
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Corollary 4 The two-locus model has the form z T = F ■ it where F is a 
9 x 25-matrix whose entries are quadratic forms in the characteristics fy. 

A typical entry in our 9 x 25 matrix F looks like 

32 • (/ 2 + 2/ooAo + Af m + 8/oi/n + / 2 2 + 2/ 02 / 12 + f 2 10 + 4f 2 n + / 2 2 ). (*) 

This quadratic form appears in F in row 6 and column 8. It is the coefficient of 
the 8 th biquartic monomial P\q\p\q\ in the expression for the Q th coordinate: 

Z12 = (32/ 2 ) • q\q\ + (64/ 2 + 64/ 2 1 ) • q\p 2 q\ 

+ (32/ 2 + 128 fa + 32/ 2 2 ) • q\p\q\ + + 

+ (*) • vAvlti + ■■■+ (64/* + 64/ 2 2 2 ) • p\q 2 p\ + (32/ 2 2 ) • V \ V \. 



5 Surfaces of degree 32 in the 8-dimensional simplex 

Let A 8 denote the eight-dimensional probability simplex 

2 2 

{ (z 00 ,z i, ... ,zxt) ■ Zij > for i,j e {0, 1,2} and J2J2 z v = l }- 

Likewise, we consider the product of two 1-simplices, which is the square 

AixAi = {(pi,<7i,P2,<?2) : Pi,qi,P2,qi > and p x +q 1 = p 2 + q 2 = 1 }• 

For fixed F, the formula z T = F -it in Corollary 4 specifies a polynomial map 

F : Ai x Ai — > A s of bidegree (4,4). 

The image of the map F is the two-locus model for fixed characteristics fy. 
The model is a surface in the simplex A§. Our goal in this section is to express 
this surface as the common zero set of a system of polynomials in the z^. 

Theorem 5 For almost all characteristics fy, the two-locus model is a surface 
of degree 32 in the simplex A 8 . This surface is the common zero set of the 
degree 32 polynomials gotten by projection into three-dimensional subspaces. 

Proof. We work in the setting of complex projective algebraic geometry. Con- 
sider the embedding of the product of projective lines P 1 xP 1 by the ample line 
bundle (9(4, 4). This is a toric surface X of degree 32 in P 24 . The 9 x 25-matrix 
F defines a rational map from P 24 to P 8 , and it can be checked computation- 
ally that this map has no base points on X for general fy. Hence the image 
F(X) of X in P 8 is a rational surface of degree 32. The two-locus model is 
the intersection of F(X) with A 8 , which is the positive orthant in P 8 . 
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Let A denote a generic 4 x 9-matrix, defining a rational map P 8 — > P 3 . It 
has no base points on F(X), hence the image AF(X) of F(X) under A is a 
surface of degree 32 in projective 3-space P 3 . The inverse image of AF(X) in 
P 8 is an irreducible hypersurface of degree 32 in P 8 . It is defined by an irre- 
ducible homogeneous polynomial of degree 32 in z — (zqo, zqi, . . . , 222)- These 
polynomials for various 4 x 9-matrices A are known as the Chow equations 
of the surface F(X). Computing them is equivalent to computing the Chow 
form of F(X). A well-known construction in algebraic geometry (see e.g. [3, 
§3.3]) shows that any irreducible projective variety is set-theoretically defined 
by its Chow equations. Applying this result to F(X) completes the proof. □ 

We now explain how Theorem 5 translates into an explicit algorithm for com- 
puting the algebraic invariants of the two-locus model. Let Tlx be the Chow 
form of the toric surface X ~ P 1 x P 1 in P 24 . The Chow form Tlx is the 
multigraded resultant of three polynomial equations of bidegree (4,4): 

44 44 44 

EE«^V = EE/W = EE "'./•''' = °- 

i=0 j=0 i=0 j=0 i=0 j=0 

In concrete terms, Tlx is the unique (up to sign) irreducible polynomial of 
tridegree (32,32,32) in the 75 unknowns a, (3, 7 which vanishes if and only if 
the three equations have a common solution in P 1 x P 1 . 

We use the Bezout matrix representation of the resultant H x given in [4, 
Theorem 6.2]. This is a 32 x 32-matrix B which is a direct generalization of 
the 4 x 4-matrix in (3). Consider the 3 x 25-coefncient matrix 

/ \ 

«oo «oi "02 «03 "04 oi 10 a n a 43 a 44 

A)0 A)l A)2 A)3 A)4 PlO Pll @43 $44 

y 7oo 7oi 702 703 704 7io 7n 743 744 / 

For 1 < i < j < k < 25, let [ijk] denote the determinant of the 3x3- 
submatrix with column indices k. The entries in the Bezout matrix B are 
the linear forms in the brackets [ijk], and we have H x = det(B). 

Let F be the 9 x 25-matrix in Corollary 4. We add the column vector z to get 
the 9 x 26-matrix (F z). Next we pick any 4 x 9-matrix A and we consider 

A-(Fz) = (A-F A-z). 

This is a 4 x 26-matrix whose last column consists of linear forms in the Zij. 

In the Bezout matrix B, we now replace each bracket [ijk] by the 4x4- 
subdeterminant of A ■ (P z) with column indices i,j, k and 26. Thus [ijk] 
is a linear form in the z^ whose coefficients are homogeneous polynomials 
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of degree six in the fy. The matrix gotten by this substitution is denoted 
B(A • (F z)). Its determinant is the specialized resultant Tlx (A • (F z)). 

Corollary 6 The resultant Tlx (A • (F z)) is a homogeneous polynomial of 
degree 32 in the entries of A. Its coefficients are polynomials which are 
bihomogeneous of degree 32 in the Zij and degree 192 in the fy. The two-locus 
model is cut out by this finite list of coefficient polynomials in the z^ and fy . 

Proof. Each entry of the 32 x 32- matrix B(A • (F z)) is a polynomial which 
is trihomogeneous of degree (1,6,1) in (a^, fy, Zif). Hence its determinant is 
trihomogeneous of degree (32, 192, 32). For fixed A and fixed F, the resulting 
polynomial defines a hypersurface of degree 32 in P 24 . This hypersurface is 
the inverse image of the surface AF(X) in P 3 . As discussed in the proof of 
Theorem 5, our model is the intersection of these hypersurfaces for all possible 
choices of A. A finite basis for the linear system of these hypersurfaces is given 
by the coefficient polynomials of Tlx (A • (F z)) with respect to A. □ 

The finite list of algebraic invariants described in the previous corollary is the 
two- locus generalization of the one- locus invariant in Proposition 2. Note that 
the bidegree in (F, z) has now increased from (4, 8) to (32, 192). Our derivation 
of these invariants from the Chow form of a Segre- Veronese variety generalizes 
to the /c-locus case, where F and z are /c-dimensional tables of format 3 x 3 x 
• • • x 3. The analogous invariants have bidegree ( k\ 4 fc , 2{k + 1)! 4 fe ) in (z, F). 



6 Computational experiments and statistical perspectives 

We prepared a test implementation in maple of the elimination technique 
described in the previous section. That code is available at the first au- 
thor's website www.stat.berkeley.edu/~ingileif/. The input is a triple 
((fij), (zij),A) consisting of a 3 x 3-matrix of model characteristics, a 3 x 3- 
matrix of model coordinates, and a projection matrix of size 4x9. Each entry 
in these input matrices can be either left symbolic or it can be specialized to a 
number. Our program builds the specialized Bezout matrix B(A- (F z)), and, 
if the matrix entries are purely numeric, then it evaluates the determinant 



Here are some examples of typical computations with our maple program. Set 



K X (A-(F z)). 



Z 0Q =3 Zqi = 3 Z 2 = 5 

z 10 = 29 z u — 11 z 12 = 13 
z 2 o = 17 Z21 = 19 z 2 2 = 23 




48 
39 
22 
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and 
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^10000000 0^ 
010000000 
000100000 
\0 1 

Then H(A • (F z)) is a 32 x 32- matrix whose entries b it j are integers, e.g., 

6i,i = 26967093018624, & 1>2 = -114552012275712, . . . , 6 32i32 = 845647773696. 

The determinant of this 32 x 32-matrix is a non-zero integer with 469 digits: 

K x {A-{Fz)) = 0.2704985126... • 10 469 . 

We now retain the numerical values for the model characteristics fa and the 
matrix A from before but we make the model coordinates Zij indeterminates. 
Then ~B(A ■ (F z)) is a 32 x 32-matrix whose entries hj are linear forms 



61.1 = -2630935904256^00 + 1315467952128^01 

+1315467952128 z 10 - 657733976064 z u 

61. 2 = 11746198683648^00 - 8211709034496^01 

-5873099341824 3 10 + 4105854517248 z u 



Its determinant TZx(A- (F z)) is an irreducible polynomial of degree 32 which 
vanishes on the model with the given characteristics fa. In fact, up to scaling, 
it is the unique such polynomial which depends only on z 0Q , z ±, z w and z u . 

Finally, we reverse the role of the coordinates z^ and the characteristics fa, 
namely, we fix the former at their previous numerical values (z 00 — 3, . . . , z 22 — 
22) but we regard the fa as indeterminates. Then ~B(A ■ (F z)) is a 32 x 32- 
matrix whose entries bij are homogeneous polynomials of degree six, e.g., 

6 M = 671744 / 6 - 1343488 / 5 / i - 1343488 / 5 / 10 

+ 671744 / Vo 2 i + 2686976 / 4 /oi Ao + 671744 fafa 

- 1343488 /oX/io - 1343488 f m f m f w + 671744 / Voi /io- 

Now lZx{A ■ (F z)) is an irreducible homogeneous polynomial of degree 192 
in the nine characteristics fa. The vanishing of this polynomial provides an 
algebraic constraint on the set of all models (fa) which fit the given data (%). 

In linkage analysis, the characteristics fa can take on any real value between 
and 1. Two-locus models are often constructed by first picking two one-locus 
characteristics, g = (go,gi,g2) and h = (ho, h±, h 2 ), from a class of special 
models such as recessive or dominant. Then the two-locus model is defined by 
combining the one-locus characteristics in one of the following ways: 
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multiplicative : = ^ • hj 
heterogeneous : = g% + hj — gi ■ hj 
additive : fij = g% + hj 

The 9 x 25-matrix F of the multiplicative model is the tensor product of the 
two 3 x 5-matrices gotten from g and h as in Proposition 1. Hence the surface 
of the multiplicative model is the Segre product of two one-locus curves. The 
heterogeneous model and the additive model are too special, in the sense that 
the corresponding surfaces in P 8 have degree less than 32. In these two cases, 
the resultant 7Zx(A • (F z)) vanishes identically, and our maple code always 
outputs zero. The surfaces arising from these two models require a separate 
algebraic study. Conducting this study could be a worthwhile next step. 

The following two-locus analogue to Holmans' triangle (the smaller triangle 
in Figure 2) was derived in [1]. For affected sibling pairs the IBD sharing 
probabilities z = (z Q0 , z i, • • • , ^22) satisfy H • z T > where H is the inverse 
of K® 2 and 

K = - 
4 

So, in practical applications we are only interested in the intersection of our 
degree 32 surface with the 8-simplex defined by these linear inequalities. 

In summary, in this paper we have presented a model for the sharing of genetic 
material of two affected siblings, used in genetic linkage analysis, in the frame- 
work of algebraic geometry. The model is rich in structure, but this structure 
is not yet fully exploited in statistical tests for genetic linkage. For plausible 
biological models we expect to see increased sharing between affected sibling 
pairs at gene loci linked to the disease. The null hypothesis for linkage is re- 
jected only if the estimate of the model coordinates, z, differs significantly from 
z nu u. This is a geometric statement about the distance between two points in 
a triangle (for k = 1) or in an 8-simplex (for k = 2). We believe that the 
algebraic representation of the model derived here will be useful for deriving 
new test statistics for linkage in the case when k > 2. 
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